Identifying Trends From VT Tweets

Question

What trending topics occurred at VT over the last few days, and when did they occur?

Hypothesis

It can be hypothesized that majority of tweets that took place between November 6 and 14 will be about college football.

Assumptions

During the fall semester, football is a major sport topic among college students. During major games, frequency of number of tweets may increase during big games.

Description of Data

Data was given by the VT's instructor to perform analysis .

Method

TF-IDF (Term frequency inverse document frequency)

At this stage, we can go ahead and make 'bag' column in the tweets dataframe that will consist of each word that is appearing in 'text' column.

Now we can aggregate tweets by date in order in order to find the trending topics of that day for further analysis.

Since the tweets are aggregated by dates, now we can clean the 'text' column by removing some punctuation which can alter the accuracy further.

Note during the cleaning process of 'text', we can go ahead and also remove the stop words using the 'stopwords.txt' which was given in the class by instructor.

Finally, after all the cleaning, we are reached at a stage where we can create a big bag of tweets text for each particular date which will be useful for calculating 'term frequency (tf)'.

We notice that most number of words in tweets appeared between the dates 11/12/2021 and 11/14/2021 approx.. 35,000. We also note that words suddenly decreased after 11/14/2021 to less than 5000 (almost close to 500) on 11/15/2021.

Now since the tweets bag is ready we can calculate term frequency each word in our bag.

We notice that most words appeared on all the given dates are either tech, virginia, hokies and coach. It is certain that people were frequently using these terms when they were tweeting since it has a high term frequency value.

To further move ahead, we can now begin calculating inverse document freqeuncy (idf).

From the visualization above, we note that more than 6000 words have idf value greater than '2'.

Now since both tf and idf are being evaluated, we can calculate TF-IDF. If the TF-IDF score for a word is non-zero means more significance of that word and not significant if TF-IDF score is 0(meaning the word has appeared on all the dates).

We note that words like transmission, ct, u, loses, veternsday, dakota, undervalued, httpstockssaqecebr, EED had much more significant on the dates shown in the above cell meaning it seems to be appeared very less or none except the date it was tweeted. Words like 'ct', 'u', 'EED', and 'https....' do not make a lot of sense or what it actually means.

Distinct events such as veteransday is one of meaningful distinct or trending topics that occurred at various times during the time period.